Inferring the location of authors from words in their texts
نویسندگان
چکیده
For the purposes of computational dialectology or other geographically bound text analysis tasks, texts must be annotated with their or their authors’ location. Many texts are locatable but most have no explicit annotation of place. This paper describes a series of experiments to determine how positionally annotated microblog posts can be used to learn location indicating words which then can be used to locate blog texts and their authors. A Gaussian distribution is used to model the locational qualities of words. We introduce the notion of placeness to describe how locational words are. We find that modelling word distributions to account for several locations and thus several Gaussian distributions per word, defining a filter which picks out words with high placeness based on their local distributional context, and aggregating locational information in a centroid for each text gives the most useful results. The results are applied to data in the Swedish language. 1 Text and Geographical Position Authors write texts in a location, about something in a location (or about the location itself), reside and conduct their business in various locations, and have a background in some location. Some texts are personal, anchored in the here and now, where others are general and not necessarily bound to any context. Texts written by authors reflect the above facts explicitly or implicitly, through explicit author intention or incidentally. When a text is locational, it may be so because the author mentions some location or because the author is contextually bound to some location. In both cases, the text may or may not have explicit mentions of the context of the author or mention other locations in the text. For some applications, inferring the location of a text or its author automatically is of interest. We present in this paper how establishing the location of a text can be done by the locational qualities of the terminology used by its author. Here, we investigate the utility of doing so for two distinct use cases. Firstly, for detecting regional language usage for the purposes of real-time dialectology. The issue here is to find differences in term usage across locations and to investigate whether terminological variation differs across regions. In this case, the ultimate objective is to collect sizeable text collections from various regions of a linguistic area to establish if a certain term or turn of phrase is used more or less frequently in some specific region. The task is then to establish where the author of a text originally is from. This has hitherto been investigated by manual inspection of text collections. (Parkvall 2012, e.g.) Secondly, for monitoring public opinion of e.g. brands, political issues, or other topic of interest. In this case the ultimate objective is to find whether there is a regional variation for the occurrence of opinionated mentions for the topic or topical target under consideration. The task is then to establish the location where a given text is written, or, alternatively, what location the text refers to. In both cases, the system is presented with a body of text with the task of assigning a likely location to it. In the former task, typically the body of text is larger and noisier (since authors may refer to other locations than their immediate context); in the second task, the text may be short and have little evidence to work from. Both tasks, that of identifying the location of an author, or that of a text, have been addressed by recent experiments with various points of departure: knowledgebased, making use of recorded points of interest in a location, modelling the geographic distribution of topics, or using social network analysis to find additional information about the author. This set of experiments focuses on the text itself and on using distributional semantics to refine the set of terms used for locating a text. 2 Location and words as evidence of locations Most words contribute little or not at all to positioning text. Some words are dead giveaways: an author may mention a specific location in the text. Frequently, but not always, this is reasonable evidence of position. Some words are less patently locational, but contribute incidentally, such as the name of some establishment or some characteristic feature of a location. ar X iv :1 61 2. 06 67 1v 1 [ cs .C L ] 2 0 D ec 2 01 6 Some locational terms are polysemous; some inspecific; some are vague. As indicated in Figure 1, the term Falköping unambiguously indicates a town in Southern Sweden, which in turn is a vague term without a clear and well defined border to other bits of Sweden. The term Södermalm is polysemous and refers to a section of town in several Swedish towns; the term spårvagn (“tram”) is indicative of one of several Swedish towns with tram lines. We call both of these latter types of term polylocational and allow them to contribute to numerous places simultaneously. Other words contribute variously to location of a text. Some words are less patently locational than named places, but contribute incidentally, such as the name of some establishment, some characteristic feature of a location, some event which takes place in some location, or some other topic the discussion of which is more typical in one location than in another. We will estimate the placeness of words in these experiments. Figure 1: Some terms are polylocational 3 Mapping from a continuous to a discrete representation We, as has been done in previous experiments, collect the geographic distribution of word usage through collecting microblog posts, some of which have longitude and latitude, from Twitter. Posts with location information are distributed over a map in what amounts to a continuous representation. The words from posts can be collected and associated with the positions they have been observed in. First experiments which use similar training data to ours have typically assigned the posts and thus the words they occur in directly to some representation of locations a word which occurs in tweets at [N59.35,E18.11] and [N59.31,E18.05] will have both observations recorded to be in the same city (Cheng et al. 2010, Mahmud et al. 2012). An alternative and later approach by e.g. Priedhorsky et al. (2014) is to aggregate all observations of a word over a map and assign a named location to the distribution, rather than to each observation, deferring the labeling to a point in the analysis where more understanding of the term distribution is known. Another approach is to model topics as inferred from vocabulary usage in text across their geographical distribution, and then, for each text, to assess the topic and thus its attendant location visavi the topic model most likely to have generated the text in question (Eisenstein et al. 2010, Yin et al. 2011, Kinsella et al. 2011, Hong et al. 2012). We have found that topic models as implemented are computationally demanding, do not add accuracy to prediction, and have little explanatory value to aid the understanding of localised language use. In these experiments we will compare using a list of known places with a model where we aggregate the locational information provided by words (and potentially other linguistic items such as constructions) trained on longitude and latitude either by letting the words vote for place or by averaging the information on a word-by-word basis. The latter model defers the mapping to place until some analysis has been performed; the former assigns place to the words earlier in the process.
منابع مشابه
Methodology for Inferring Moral Priorities According to the Narrations of "Afal Tafzil"
Considering the different levels of moral values in Islam, in order to know the most important values and also to eliminate the contradiction, it is necessary to deduce from the texts of verses and hadiths. One of the most important aspects in these texts is the "structure of Tafzil". Some narrations of this structure indicate the priority of one or more values and others indicate a rule in det...
متن کاملAuthor gender identification from text using Bayesian Random Forest
Nowadays high usage of users from virtual environments and their connection via social networks like Facebook, Instagram, and Twitter shows the necessity of finding out shared subjects in this environment more than before. There are several applications that benefit from reliable methods for inferring age and gender of users in social media. Such applications exist across a wide area of fields,...
متن کاملارائه روشی برای استخراج کلمات کلیدی و وزندهی کلمات برای بهبود طبقهبندی متون فارسی
Due to ever-increasing information expansion and existing huge amount of unstructured documents, usage of keywords plays a very important role in information retrieval. Because of a manually-extraction of keywords faces various challenges, their automated extraction seems inevitable. In this research, it has been tried to use a thesaurus, (a structured word-net) to automatically extract them. A...
متن کاملMaterial Development and English for Academic Purposes Word Lists; a Reductionist Approach
Nagy (1988) states that vocabulary is a prerequisite factor in comprehension. Drawing upon a reductionist approach and having in mind the prospects for material development, this study aimed at creating an English for Academic Purposes Word List (EAPWL). The corpus of this study was compiled from a corpus containing 6479 pages of texts, 2,081,678 million tokens (running words) and 63825 types (...
متن کاملPlagiarism checker for Persian (PCP) texts using hash-based tree representative fingerprinting
With due respect to the authors’ rights, plagiarism detection, is one of the critical problems in the field of text-mining that many researchers are interested in. This issue is considered as a serious one in high academic institutions. There exist language-free tools which do not yield any reliable results since the special features of every language are ignored in them. Considering the paucit...
متن کامل